Keegan Mallaney¶

Importing required libraries

In [1]:
import pandas as pd
import seaborn as sns

import matplotlib.pyplot as plt

Dataset: https://www.kaggle.com/datasets/danbraswell/us-tornado-dataset-1950-2021

This data set gives information about american tornados from the year 1950 until the year 2021.

From Kaggle: Origin

This dataset was derived from a dataset produced by NOAA's Storm Prediction Center. The primary changes made to create this dataset were the deletion of some columns, change of some data types, and sorting by date. Column Definitions

yr - 4-digit year
mn - Month (1-12)
dy - Day of month
date - Datetime object (e.g. 1950-01-01)
st - State where tornado originated; 2-digit abbreviation
mag - F rating thru Jan 2007; EF rating after Jan 2007 (-9 if unknown rating)
inj - Number of injuries
fat - Number of fatalities
slat - Starting latitude in decimal degrees
slon - Starting longitude in decimal degrees
elat - Ending latitude in decimal degrees (value of 0 if missing)
elon - Ending longitude in decimal degrees (value of 0 if missing)
len - Length of track in miles
wid - Width in yards

My goal is to really identify which state and which timeframe is the most likely to have damaging tornados occur.

Importing and Previewing Data

In [2]:
df = pd.read_csv('US_Tornados_1950_2021.csv')
In [3]:
df.head()
Out[3]:
yr mo dy date st mag inj fat slat slon elat elon len wid
0 1950 1 3 1950-01-03 IL 3 3 0 39.10 -89.30 39.12 -89.23 3.6 130
1 1950 1 3 1950-01-03 MO 3 3 0 38.77 -90.22 38.83 -90.03 9.5 150
2 1950 1 3 1950-01-03 OH 1 1 0 40.88 -84.58 0.00 0.00 0.1 10
3 1950 1 13 1950-01-13 AR 3 1 1 34.40 -94.37 0.00 0.00 0.6 17
4 1950 1 25 1950-01-25 IL 2 0 0 41.17 -87.33 0.00 0.00 0.1 100

Creating a correlation matrix to evaluate any relationships

In [4]:
sns.heatmap(df.corr())
Out[4]:
<AxesSubplot:>
In [5]:
df.shape
Out[5]:
(67558, 14)
In [6]:
df.isnull().sum()
Out[6]:
yr      0
mo      0
dy      0
date    0
st      0
mag     0
inj     0
fat     0
slat    0
slon    0
elat    0
elon    0
len     0
wid     0
dtype: int64
In [7]:
df.dtypes
Out[7]:
yr        int64
mo        int64
dy        int64
date     object
st       object
mag       int64
inj       int64
fat       int64
slat    float64
slon    float64
elat    float64
elon    float64
len     float64
wid       int64
dtype: object
In [8]:
df.nunique()
Out[8]:
yr         72
mo         12
dy         31
date    12300
st         53
mag         7
inj       209
fat        50
slat    14215
slon    16024
elat    15043
elon    16571
len      2429
wid       405
dtype: int64
In [9]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 67558 entries, 0 to 67557
Data columns (total 14 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   yr      67558 non-null  int64  
 1   mo      67558 non-null  int64  
 2   dy      67558 non-null  int64  
 3   date    67558 non-null  object 
 4   st      67558 non-null  object 
 5   mag     67558 non-null  int64  
 6   inj     67558 non-null  int64  
 7   fat     67558 non-null  int64  
 8   slat    67558 non-null  float64
 9   slon    67558 non-null  float64
 10  elat    67558 non-null  float64
 11  elon    67558 non-null  float64
 12  len     67558 non-null  float64
 13  wid     67558 non-null  int64  
dtypes: float64(5), int64(7), object(2)
memory usage: 7.2+ MB

Converting to String in preparation for Month names

In [10]:
df['mo'] = df['mo'].astype(str)
In [11]:
df.dtypes
Out[11]:
yr        int64
mo       object
dy        int64
date     object
st       object
mag       int64
inj       int64
fat       int64
slat    float64
slon    float64
elat    float64
elon    float64
len     float64
wid       int64
dtype: object

Converting to Month names for visuals

In [12]:
df['mo'] = df['mo'].replace(['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12'], ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'])
In [13]:
df
Out[13]:
yr mo dy date st mag inj fat slat slon elat elon len wid
0 1950 January 3 1950-01-03 IL 3 3 0 39.1000 -89.3000 39.1200 -89.2300 3.60 130
1 1950 January 3 1950-01-03 MO 3 3 0 38.7700 -90.2200 38.8300 -90.0300 9.50 150
2 1950 January 3 1950-01-03 OH 1 1 0 40.8800 -84.5800 0.0000 0.0000 0.10 10
3 1950 January 13 1950-01-13 AR 3 1 1 34.4000 -94.3700 0.0000 0.0000 0.60 17
4 1950 January 25 1950-01-25 IL 2 0 0 41.1700 -87.3300 0.0000 0.0000 0.10 100
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
67553 2021 December 30 2021-12-30 GA 1 0 0 31.1703 -83.3804 31.1805 -83.3453 2.19 150
67554 2021 December 30 2021-12-30 GA 1 0 0 31.6900 -82.7300 31.7439 -82.5412 11.71 300
67555 2021 December 31 2021-12-31 AL 1 0 0 34.2875 -85.7878 34.2998 -85.7805 0.95 50
67556 2021 December 31 2021-12-31 GA 1 0 0 33.7372 -84.9998 33.7625 -84.9633 2.75 150
67557 2021 December 31 2021-12-31 GA 1 6 0 33.5676 -83.9877 33.5842 -83.9498 2.50 75

67558 rows × 14 columns

In [14]:
sns.set(rc={"figure.figsize":(15, 8)})
In [15]:
sns.barplot(data=df, x='mo', y='fat').set(title='Fatalities per month')
Out[15]:
[Text(0.5, 1.0, 'Fatalities per month')]
In [16]:
sns.barplot(data=df, x='mo', y='inj').set(title='Injuries per month')
Out[16]:
[Text(0.5, 1.0, 'Injuries per month')]

Similiar can be said for injuries

In [17]:
stData = df.groupby('st').sum()
In [18]:
stData.head()
Out[18]:
yr dy mag inj fat slat slon elat elon len wid
st
AK 7972 61 0 0 0 236.9000 -630.5000 236.9000 -630.5000 0.20 10
AL 4706753 38632 2381 8672 665 77548.7157 -204722.1012 59192.3210 -155354.7776 13047.41 412780
AR 3808313 29740 2119 5408 400 66753.3221 -176617.5860 48549.2136 -128049.1970 11356.39 348494
AZ 537125 4341 104 152 3 9148.7256 -30158.2005 4222.5598 -13712.7914 527.35 19709
CA 921157 7223 159 90 0 16787.2859 -55287.0377 9499.5146 -31056.3938 520.80 19524
In [19]:
stData['fat'].value_counts()[:10]
Out[19]:
0      11
1       4
4       4
2       2
194     2
438     1
28      1
60      1
31      1
226     1
Name: fat, dtype: int64
Further splitting of data
In [20]:
# Ten largest values in column fat
deadliestST = stData.nlargest(10, ['fat'])
In [21]:
deadliestST.head()
Out[21]:
yr dy mag inj fat slat slon elat elon len wid
st
AL 4706753 38632 2381 8672 665 77548.7157 -204722.1012 59192.3210 -155354.7776 13047.41 412780
TX 18188289 149133 5096 9412 591 293386.8538 -901295.0097 147586.7371 -451737.2735 22182.26 846104
MS 4940920 40654 2587 6452 476 80361.5682 -221997.9981 59312.3935 -163382.8908 14794.95 451986
OK 8133230 65898 2641 5997 438 145490.8846 -398575.9309 85184.3134 -232743.2105 15508.93 593659
TN 2656272 20272 1412 4936 407 47652.9264 -115638.7184 35960.8603 -86890.5627 6690.46 212440
In [22]:
deadliestST.reset_index(inplace=True)
In [23]:
deadliestST.columns
Out[23]:
Index(['st', 'yr', 'dy', 'mag', 'inj', 'fat', 'slat', 'slon', 'elat', 'elon',
       'len', 'wid'],
      dtype='object')
In [24]:
sns.barplot(data=deadliestST, x='st', y='fat').set(title='Fatalities in the top 10 deadliest states')
Out[24]:
[Text(0.5, 1.0, 'Fatalities in the top 10 deadliest states')]
In [25]:
sns.barplot(data=deadliestST, x='st', y='inj').set(title='Injuries in the top 10 deadliest states')
Out[25]:
[Text(0.5, 1.0, 'Injuries in the top 10 deadliest states')]
Importing plotly express module for choropleth
In [26]:
import plotly.express as px
import plotly.graph_objects as go
In [27]:
fig = px.choropleth(deadliestST,
                   locations='st',
                   locationmode="USA-states",
                   scope="usa",
                   color='fat',
                   color_continuous_scale="YlOrRd"
                   )

fig.update_layout(
        title_text = 'Top Ten Fatality States in the last 70 or so years')

fig.show()
Above is an interactive map showing fatalities in the top 10 fatality states.
In [28]:
stMap = stData
In [29]:
stMap = stMap.reset_index()
In [30]:
stMap.columns
Out[30]:
Index(['st', 'yr', 'dy', 'mag', 'inj', 'fat', 'slat', 'slon', 'elat', 'elon',
       'len', 'wid'],
      dtype='object')
In [31]:
fig = px.choropleth(stMap,
                   locations='st',
                   locationmode="USA-states",
                   scope="usa",
                   color='fat',
                   color_continuous_scale="YlOrRd"
                   )

fig.update_layout(
        title_text = 'Fatalities per State in the last 70 or so years')

fig.show()
Above is an interactive map showing fatalities per state.
In [32]:
df.head()
Out[32]:
yr mo dy date st mag inj fat slat slon elat elon len wid
0 1950 January 3 1950-01-03 IL 3 3 0 39.10 -89.30 39.12 -89.23 3.6 130
1 1950 January 3 1950-01-03 MO 3 3 0 38.77 -90.22 38.83 -90.03 9.5 150
2 1950 January 3 1950-01-03 OH 1 1 0 40.88 -84.58 0.00 0.00 0.1 10
3 1950 January 13 1950-01-13 AR 3 1 1 34.40 -94.37 0.00 0.00 0.6 17
4 1950 January 25 1950-01-25 IL 2 0 0 41.17 -87.33 0.00 0.00 0.1 100
In [33]:
df.dtypes
Out[33]:
yr        int64
mo       object
dy        int64
date     object
st       object
mag       int64
inj       int64
fat       int64
slat    float64
slon    float64
elat    float64
elon    float64
len     float64
wid       int64
dtype: object
Further data manipulation to get data separated.
In [34]:
stCounts = df.value_counts('st').reset_index()
In [35]:
stCounts.head()
Out[35]:
st 0
0 TX 9149
1 KS 4375
2 OK 4092
3 FL 3497
4 NE 2967
In [36]:
stCounts.columns
Out[36]:
Index(['st', 0], dtype='object')
In [37]:
stCounts = stCounts.rename(columns={0: "Count"})
In [38]:
stCounts.dtypes
Out[38]:
st       object
Count     int64
dtype: object
In [39]:
#sns.barplot(data=stCounts, x='Count', y='st').set(title='Tornado counts per state')
In [40]:
torCounts = stCounts['Count']
states = stCounts['st']

fig = plt.figure(figsize = (20,10))

plt.bar(states, torCounts)

plt.title('Tornado counts per state')
plt.ylabel('Count')
plt.xlabel('State')

plt.show()
In [41]:
moCounts = df.value_counts("mo")
moCounts.columns = ['mo', 'count']
In [42]:
moCounts.head(12)
Out[42]:
mo
May          14818
June         12492
April         9573
July          6971
August        4788
March         4514
September     3471
October       2802
November      2647
February      1945
December      1818
January       1719
dtype: int64
In [43]:
moCounts.columns
Out[43]:
['mo', 'count']
In [44]:
sns.lineplot(data=moCounts).set(title='Tornado instances per month')
Out[44]:
[Text(0.5, 1.0, 'Tornado instances per month')]
With Texas being towards the top of Fatalities, Injuries, and tornado count it is worth looking further into Texas specifically
In [45]:
TX = df[df['st'].str.contains('TX')]
In [46]:
TX.head()
Out[46]:
yr mo dy date st mag inj fat slat slon elat elon len wid
6 1950 January 26 1950-01-26 TX 2 2 0 26.88 -98.12 26.88 -98.05 4.7 133
7 1950 February 11 1950-02-11 TX 2 0 0 29.42 -95.25 29.52 -95.13 9.9 400
8 1950 February 11 1950-02-11 TX 2 5 0 32.35 -95.20 32.42 -95.20 4.6 100
9 1950 February 11 1950-02-11 TX 2 6 0 32.98 -94.63 33.00 -94.70 4.5 67
10 1950 February 11 1950-02-11 TX 3 12 1 29.67 -95.05 29.83 -95.00 12.0 1000
In [47]:
txCounts = df.value_counts("mo")
txCounts.columns = ['mo', 'count']
In [48]:
txCounts.head(12)
Out[48]:
mo
May          14818
June         12492
April         9573
July          6971
August        4788
March         4514
September     3471
October       2802
November      2647
February      1945
December      1818
January       1719
dtype: int64
In [49]:
txCounts.columns
Out[49]:
['mo', 'count']
In [50]:
txCounts.dtypes
Out[50]:
dtype('int64')
In [51]:
txMoCounts = df.groupby('mo').sum()
In [52]:
txMoCounts.head(12)
Out[52]:
yr dy mag inj fat slat slon elat elon len wid
mo
April 19070543 161950 8750 30103 1863 342892.5693 -8.786775e+05 234770.5950 -596096.1246 49238.74 1440833
August 9533227 75922 2284 2967 128 185718.8583 -4.389518e+05 101695.6803 -235473.7705 9870.59 348604
December 3628253 29335 1818 4470 280 62386.1304 -1.656981e+05 46658.6122 -122608.2432 9370.05 230605
February 3874228 31893 1989 6550 471 65147.9998 -1.746559e+05 43713.6637 -115702.1465 9678.83 260687
January 3431080 27620 1550 2864 171 57136.2030 -1.548184e+05 42466.7059 -114081.6532 7231.04 231177
July 13867895 103018 3547 2304 73 280468.3024 -6.462901e+05 142898.2403 -321788.8878 13303.95 502237
June 24847824 181156 7400 9636 569 492467.7621 -1.184246e+06 265686.0307 -626006.8624 31986.68 964279
March 8990220 77662 4217 10732 776 157129.9508 -4.117682e+05 106167.0159 -276270.5484 23736.07 674297
May 29507277 247811 8433 17892 1313 550775.5595 -1.405238e+06 341478.0055 -860179.7299 49667.61 1640330
November 5275431 42592 2702 5325 269 91334.2088 -2.377734e+05 62655.3987 -162300.4312 13168.46 331375
October 5592567 45240 1874 2468 102 98997.3655 -2.563770e+05 70688.1423 -181269.1689 8816.60 307695
September 6912512 51393 2137 1829 97 124812.1600 -3.138490e+05 76762.2293 -188062.0112 8921.07 268012
In [53]:
sns.lineplot(data=txMoCounts, x="mo", y="fat").set(title='Texas Tornado fatalities per month')
Out[53]:
[Text(0.5, 1.0, 'Texas Tornado fatalities per month')]
Removing any instances where the tornado's magnitude was not accurate/noted.
In [54]:
tornadoMag = df[df.mag != -9]
In [58]:
tornadoMag.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 66953 entries, 0 to 67557
Data columns (total 14 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   yr      66953 non-null  int64  
 1   mo      66953 non-null  object 
 2   dy      66953 non-null  int64  
 3   date    66953 non-null  object 
 4   st      66953 non-null  object 
 5   mag     66953 non-null  int64  
 6   inj     66953 non-null  int64  
 7   fat     66953 non-null  int64  
 8   slat    66953 non-null  float64
 9   slon    66953 non-null  float64
 10  elat    66953 non-null  float64
 11  elon    66953 non-null  float64
 12  len     66953 non-null  float64
 13  wid     66953 non-null  int64  
dtypes: float64(5), int64(6), object(3)
memory usage: 7.7+ MB
In [62]:
tornadoMag.drop(tornadoMag[tornadoMag['mag'] < 3].index, inplace = True)
C:\Users\KPMal\AppData\Local\Temp\ipykernel_34368\2421647128.py:1: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

In [63]:
tornadoMag.head()
Out[63]:
yr mo dy date st mag inj fat slat slon elat elon len wid
0 1950 January 3 1950-01-03 IL 3 3 0 39.10 -89.30 39.12 -89.23 3.6 130
1 1950 January 3 1950-01-03 MO 3 3 0 38.77 -90.22 38.83 -90.03 9.5 150
3 1950 January 13 1950-01-13 AR 3 1 1 34.40 -94.37 0.00 0.00 0.6 17
10 1950 February 11 1950-02-11 TX 3 12 1 29.67 -95.05 29.83 -95.00 12.0 1000
15 1950 February 12 1950-02-12 LA 3 25 5 31.63 -93.65 32.55 -93.03 74.5 100
In [64]:
tornadoMag.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3176 entries, 0 to 67394
Data columns (total 14 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   yr      3176 non-null   int64  
 1   mo      3176 non-null   object 
 2   dy      3176 non-null   int64  
 3   date    3176 non-null   object 
 4   st      3176 non-null   object 
 5   mag     3176 non-null   int64  
 6   inj     3176 non-null   int64  
 7   fat     3176 non-null   int64  
 8   slat    3176 non-null   float64
 9   slon    3176 non-null   float64
 10  elat    3176 non-null   float64
 11  elon    3176 non-null   float64
 12  len     3176 non-null   float64
 13  wid     3176 non-null   int64  
dtypes: float64(5), int64(6), object(3)
memory usage: 372.2+ KB
In [66]:
fig = go.Figure(data=go.Scattergeo(
        lon = tornadoMag['slon'],
        lat = tornadoMag['slat'],
        text = tornadoMag['yr'],
        mode = 'markers',
        marker_color = tornadoMag['mag'],
        ))


fig.update_layout(
        title = 'F3 or greater Tornado touchdown points',
        geo_scope='usa',
    )

fig.show()
In [67]:
tornadoMag.drop(tornadoMag[tornadoMag['mag'] < 5].index, inplace = True)
C:\Users\KPMal\AppData\Local\Temp\ipykernel_34368\4435519.py:1: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

In [74]:
fig = go.Figure(data=go.Scattergeo(
        lon = tornadoMag['slon'],
        lat = tornadoMag['slat'],
        text = tornadoMag['yr'],
        mode = 'markers',
        marker_color = tornadoMag['mag'],
        ))


fig.update_layout(
        title = 'F5 Tornado touchdown points',
        geo_scope='usa',
    )

fig.show()
In [ ]: